Interlocked synchronous pipeline clock gating

ABSTRACT

An integrated circuit including a pipeline and a method of operating the pipeline. Each stage of the pipeline is triggered by one or more triggering events and are individually, and selectively, stalled by a stall signal. For each stage a stall signal, delayed with respect to the stall signal of a downstream stage, is generated and used to select whether the pipeline stage in question is triggered. A data valid signal propagating with valid data adds further selection, such that only stages with valid data are stalled.

CROSS REFERENCE TO RELATED APPLICATIONS

The present application is a divisional application of 11/376,544 filedMar. 14, 2006 now U.S. Pat. No. 7,308,593, entitled “INTERLOCKEDSYNCHRONOUS PIPELINE CLOCK GATING” to Hans JACOBSON et al., which was adivisional of Ser. No. 10/262,769 filed Oct. 2, 2002 now U.S. Pat. No.7,065,665, entitled “INTERLOCKED SYNCHRONOUS PIPELINE CLOCK GATING” toHans JACOBSON et al., issued Jun. 20, 2006, and related to U.S. Pat. No.7,475,227, entitled “INTERLOCKED SYNCHRONOUS PIPELINE CLOCK GATING” toHans JACOBSON et al., all assigned to the assignee of the presentinvention and incorporated herein by reference.

BACKGROUND OF THE INVENTION

1. Field of the Invention

The present invention generally relates to integrated circuits and moreparticularly to controlling data propagation in a pipeline circuit.

2. Description of the Related Art

Semiconductor technology and chip manufacturing advances have resultedin a steady increase of on-chip clock frequencies, the number oftransistors on a single chip and the die size itself accompanied by acorresponding decrease in chip supply voltage. Generally, the powerconsumed by a given clocked unit (e.g., latch, register, register file,functional unit and etc.) increases linearly with the frequency ofswitching within the unit. Thus, not withstanding the decrease of chipsupply voltage, chip power consumption has increased as well. In currentmicroprocessor designs, over 70% of the power consumed is attributableto the clock alone. Typically, over 90% of this power is consumed inlocal clock splitters/drivers and latches.

Both at the chip and system levels cooling and packaging costs haveescalated as a natural result of this increase in chip power. It iscrucial for low end systems (e.g., handhelds, portable and mobilesystems) to reduce net energy consumption to extend battery life but,without degrading performance to unacceptable levels. Thus, the increasein microprocessor power dissipation has become a major stumbling blockfor future performance gains.

Accordingly, clock gating techniques that selectively stop functionalunit clocks have become the primary approach to reducing clock power.Typically, clock gating is applied in an ad hoc fashion, which makesverification and clock skew management difficult. This is not expectedto abate with ever larger and more complex designs unless a clearlydefined and structured clock gating approach is developed.

A typical state of the art synchronous pipeline includes multiplestages, at least some of which may be separated by logic, each stageincluding an N latch register, at least one latch for each data bitpropagating down the pipeline and, all of the stages synchronouslyclocked by a single global clock. A simple example of a pipeline is afirst-in first-out (FIFO) register. A FIFO is an M stage by N bitregister file, typically used as an M-clock cycle delay. Each cycle theFIFO receives an N-bit word from input logic and it passes an M-cycleold, N-bit word to output logic. On each clock cycle (i.e., every otherleading or falling clock edge) each N-bit word in the FIFO advances onestage. Typical examples of much more complex synchronous pipelinesinclude state of the art microprocessors or functional units (e.g., anI-unit or an E-unit) within a state of the art microprocessor.

Traditionally, synchronous pipelines have been stalled globally, whereall stages of either the entire pipeline, or a multistage unit, arestalled at the same time. However, cycle time and switching currentconstraints limit the number of stages that can be stalled during thesame cycle. A difficulty with progressively stalling synchronouspipelines is that data is lost at stall boundaries. Further, as wiredelays increase and become a concern, propagating a stall signalthroughout a unit or between units, for example, may cause excessivesignal delay, both from long wires and signal buffering requirements.Heretofore, achieving local clock gating based on stall conditions hasnot been possible because stalled data may be overwritten by dataprogressing through the pipeline from an earlier stage.

FIG. 1A shows an example of a four portion of a synchronous pipeline 10(e.g., in the middle of a FIFO or in a microprocessor) at stages 12, 14,16, 18 holding data items D, C, B, A, respectively. A stall boundary 20indicates a point in the pipeline 10 where, because of placement andcycle time constraints, the next clock edge arrives at upstream stagesbefore stall signal 22, thus providing insufficient time to disable theclock at those upstream stages. While the stall signal 22 reachesdownstream stage 16 and subsequent stages (not shown) with sufficientdisable time and correctly halt; because stages 12, 14 and stagesupstream of the boundary 20 do not receive the stall signal in time,they therefore latch new data on the clock edge incorrectly, potentiallylosing data that should be held there. So, in this example stages 16 and18 are stalled, trapping data items B and A, respectively. Stages 12, 14however, do not see the stall signal in time and therefore, latch dataitems E and D in the next clock cycle. Consequently, data item C isoverwritten and lost, instead of being trapped in stalled stage 14.

FIG. 1B shows a traditional approach to handling progressive stallswherein buffer stages 23 (often referred to as staging latches) areinserted in parallel to the pipeline at selected stall boundaries, e.g.,20. During a stall the staging latches 23 temporarily store data thatwould otherwise be overwritten. Unfortunately, because staging latches23 add area, power, and delay overhead, stalls have traditionally beenperformed at a coarse level, i.e., staging latches are only at predictedstall boundaries. However, as noted above for globally propagated stallsignals, increased wire delays, increased load on the stall signal fromincreasing the number of latches to achieve deeper pipelines (morestages) and demand for shorter cycle time combine to restrict how farthe stall signal can propagate before it impacts cycle time. So,providing staging latches at a finer granularity, e.g., for stallingstage by stage, introduces extra buffer stages to double the number oflatches in a pipeline. Clearly, the added staging latch area and poweras well as increased chip complexity renders this solution impracticalat other than a very coarse granularity.

Thus, there exists a need for fine grained pipeline stage level clockgating for synchronous pipelines and where the decision to or not togate the clock can be made local to each stage rather than at the globallevel, while avoiding costly extra buffers.

SUMMARY OF THE INVENTION

It is a purpose of the invention to minimize clock power in synchronousdesigns;

It is another purpose of the invention to increase clock gatingflexibility;

It is yet another purpose of the invention to improve pipeline clockcontrol signal slack;

It is yet another purpose of the invention to reduce synchronous logicdesign effort with a natural, clearly defined and structured approach toclock gating;

It is yet another purpose of the invention to progressively stall highfrequency pipelines without using staging latches or data hold muxes;

It is yet another purpose of the invention to increase effectivepipeline storage capacity;

It is yet another purpose of the invention to increase storage capacityin queue structures.

The present invention is an integrated circuit including a pipeline anda method of operating the pipeline. Each stage of the pipeline istriggered by a trigger event and individually, selectively stalled by astall signal. For each stage a stall signal, delayed with respect to thestall signal of a downstream stage, is generated and used to selectwhether the pipeline stage in question is triggered. A data valid signalpropagating with valid data adds further selection, such that onlystages with valid data are stalled.

BRIEF DESCRIPTION OF THE DRAWINGS

The foregoing and other objects, aspects and advantages will be betterunderstood from the following detailed description of illustrativeembodiments of the invention with reference to the drawings, in which:

FIG. 1A shows a synchronous pipeline with a stall between two stages;

FIG. 1B shows a traditional approach to handling progressive stallswherein buffer stages (staging latches) are inserted in parallel to thepipeline at stall boundaries;

FIG. 2A shows a first preferred embodiment in a progressively stalledinterlocked pipeline with distributed handshake and trigger eventcontrol logic.

FIG. 2B shows a preferred embodiment in a progressively stalledinterlocked pipeline with centralized handshake and trigger eventcontrol logic.

FIG. 2C illustrates cross communication between multiple upstream anddownstream stages in a plurality of progressively stalled interlockedpipelines.

FIG. 2D is a flow diagram that illustrates how propagation of data ishandled at the interfaces of an interlocked stage with two storagenodes.

FIG. 2E is a flow diagram that illustrates how propagation of data ishandled at the interfaces of an interlocked stage with only one storagenode.

FIG. 3A shows a representative example of a typical pair of seriesconnected register stages illustrating a preferred embodiment ElasticSynchronous Pipeline (ESP);

FIG. 3B is a flow diagram showing how data passing through pipelineregister stages of FIG. 3A may be paused upon detection of a stallcondition in downstream stages;

FIG. 4A shows an example of a four stage, two phase split latch pipelinewith stall latches at each stage, propagating the stall signal backwardin the pipeline;

FIG. 4B is a corresponding timing diagram for the four stage, two phasepipeline of FIG. 4A;

FIG. 4C shows a sub-trace of the timing diagram example of FIG. 4B;

FIG. 5 shows an example of a four stage, two phase split latch pipelinewherein each register stage has a valid bit latch;

FIG. 6A shows a representative example of a typical pair of seriesconnected register stages illustrating an interlocked synchronouspipeline (ISP) preferred embodiment;

FIG. 6B is a flow diagram showing how data passing through pipelineregister stages of FIG. 6A may be paused upon detection of a stallcondition;

FIG. 7A shows an example of a four stage, two phase split latchsynchronous pipeline with early valid, each stage including both aninternal interlocking stall latch and a valid data latch;

FIG. 7B is a corresponding timing diagram for a four stage, two phasesynchronous pipeline as in FIG. 7A;

FIG. 7C is a detailed sub trace of a section of the timing diagram ofFIG. 7B;

FIG. 8 shows an example of a two phase clocked non-split latchmaster/slave stage pair ISP embodiment;

FIGS. 9A-B show an example of a pulsed transparent master/slave pair fora second pulsed mode ISP embodiment with further reduced clock power;

FIG. 10 shows an application of the ISP embodiment of the presentinvention to a 1 to 2 fork stage;

FIG. 11 shows an application of the ISP embodiment of the presentinvention to a 1 to 1-of-2 branch stage;

FIG. 12 shows an application of the ISP embodiment of the presentinvention to a 2 to 1 join stage;

FIG. 13 shows an application of the ISP embodiment of the presentinvention to a 1-of-2 to 1 select stage, where stage 2 has priority overstage 1;

FIG. 14 shows an application of the ISP embodiment of the presentinvention to a multicycle pipeline.

TERM DEFINITIONS

-   PIPELINE CLOCKING: Synchronously clocked pipelines are well known in    the art. Locally clocked pipelines are also well known in the art    and may, for example, be implemented as synchronous islands which    interface either through handshake techniques as in Globally    Asynchronous Locally Synchronous (GALS) approaches or through Phase    Locked Loop (PLL) based synchronization techniques.-   DATA: We define data as any information that is present in an    integrated circuit. This includes, but is not limited to, data    signals in a data-path and control signals in a control-path.-   PIPELINE STAGE: We define a basic stage in a pipeline to contain a    single layer of data storage nodes (where a layer is a collection of    parallel storage nodes). In the present description a mentioned    stage refers to this definition unless otherwise indicated either    explicitly or by the context.-   SPLIT LATCH PIPELINE: Since stages with two sequential layers of    storage nodes are more frequently used in the art, pipelines that    are made up of stages with only one layer of storage nodes are often    referred to as using split latches.-   DATA STORAGE NODE: A data storage node in a stage of a split latch    pipeline can, for example, be one of the following: a transparent    latch, a precharged domino logic, a precharged cross-coupled    inverter latch, or a Set-Reset latch. Such latches and other    variations are well known in the art.-   PIPELINE DOUBLE-STAGE: In many modern VLSI designs, the storage    nodes in two adjacent split latch stages are merged and together are    considered to be a stage. This is frequently done in pipelines based    on, for example, the following latch types: a master/slave latch or    D-flip/flop, or a sense-amplifier latch. Such latches and other    variations containing two sequential data storage nodes are well    known in the art.-   DATA STORING: The act of storing data in mentioned storage nodes is    performed in response to a triggering event. A node that stores data    also inhibits new input data from passing through the node.-   TRIGGER EVENT: The mentioned triggering event can, for example, be    one of the following: a rising or falling edge on a global or local    clock or signal, a pulse on a clock or other signal, an edge or a    pulse on an asynchronous sequencing signal, or an edge or a pulse on    a timing signal. Clocks, pulses, asynchronous sequencing signals,    and other types of timing signals are well known in the art.-   ALTERNATE EVENTS: Although not necessarily always so (e.g., for    pulsed master-slave pipelines (elaborated later)), adjacent stages    in split-latch pipelines typically store data on alternate    triggering events. For example, a stage may store data on a falling    edge of a global clock while its adjacent downstream stage stores    data on the rising edge of the global clock. This is done to avoid    data racing through two or more adjacent stages thus making sure    that data in the pipeline progress in an orderly stage by stage lock    step fashion. Stages containing two storage nodes often use a    similar approach of storing data in adjacent storage nodes on    alternate triggering events.-   STALE DATA: A data is said to be stale if it will not be used in    subsequent computations. An example of stale data is data that is    duplicatively stored in adjacent stages. Once the downstream stage    stores the passed data, the same data held in the current stage    becomes stale. This situation does not occur, for example, in two    phase clocked pipelines where the edges on the two clocks overlap as    the data in the current stage is overwritten at the same time as the    data is stored in the downstream stage. However, in two phase    clocked pipelines with, for example, non-overlapping clock pulses,    it is possible for both the current and downstream stages to    momentarily store the same data. In such cases, the data in the    current stage is considered stale.-   LOGIC CIRCUIT: Logic circuits may reside between each of mentioned    stages. Such logic circuits may, for example, compute a datapath    function, a control function, or a function that gates one or more    global or local triggering events. Such logic functions can also be    used to produce valid and stall signals that are used to inhibit, or    gate, triggering events.-   STAGE COMPONENTS: A stage is an abstraction which may contain    components in addition to the mentioned layer(s) of data storage    nodes. Mentioned layer of storage nodes in a stage is also referred    to as a register stage. A stage may contain a trigger event    generator that, responsive to a stage triggering event, selectively    produces local trigger events to the different components of a    stage. A clock-splitter (or clock-block) is an example of such a    trigger event generator that is well known in the art. Based on a    set of inputs, such as signals indicating if arriving data is valid    or not, and signals indicating if the stage needs to stall or not,    the trigger event generator may selectively produce mentioned local    trigger events. A stage may also contain the logic circuits for the    generation and/or propagation of data valid and stall indications.-   UPSTREAM/DOWNSTREAM: The terms downstream and upstream are named    with respect to the direction data flows through the pipeline. A    downstream stage may also be referred to as a subsequent stage. An    upstream stage may also be referred to as a previous stage.-   ADJACENT: When the term adjacent stages is used, this means that    there is a direct communication, or connection, between the stages,    without any other stages in between. A downstream adjacent stage can    also be referred to as a next subsequent stage, and an upstream    adjacent stage may also be referred to as a next previous stage. A    stage can of course be adjacent to a plurality of downstream and    upstream stages.-   DELAYED SIGNALS: The basic operation of a pipeline is to delay data    at each stage in the pipeline such that upstream data does not catch    up with and overwrite downstream data. When it comes to indicating    if data passed to a stage is valid or not, the associated valid    indication, for example a valid bit propagating alongside the data,    must be delayed along with the data. A downstream valid indication    for a given data is therefore delayed with respect to an upstream    valid indication for mentioned data. Similarly, in the progressive    stalling techniques of the present invention, when it comes to    indicating if a given stage should be stalled or not, the associated    stall indication, for example a stall bit propagating in the    opposite direction of the data, must also be delayed. Otherwise the    stalling of stages would not be progressive, but rather coincident.    How to coincidentally, or simultaneously, stall multiple stages is    well known in the art. However, how to progressively stall a    pipeline, other than asynchronous pipelines, one stage at a time as    described herein, is novel. The present invention implements    progressive stalling by delaying a stall indication of a stage with    respect to the stall indication of a downstream stage.-   DELAY CONDITIONS: A stall signal can be delayed in two ways. First a    stall indication to an upstream stage can be delayed until just    before mentioned upstream stage is about to pass, or store, new data    due to the arrival of a new triggering event. Second, a stall    indication to an upstream stage can be delayed until valid data has    been stored in mentioned upstream stage.-   DELAY-TIME: The delay time with which a stall signal needs to be    delayed is proportional to the time it will take before a next    triggering event causes an upstream stage already storing data to    pass, or store, new data. For example, in a two phase split latch    pipeline where adjacent stages are triggered on alternate clock    edges, the delay time for a stall indication is half a clock cycle.    In pipelines where data is indicated as valid or not, the delay time    is furthermore proportional to the time it will take until valid    data arrives and has been stored in mentioned upstream stage. The    delay time of a valid signal is proportional to the time it will    take before a next triggering event causes a downstream stage to    either store or pass new data depending on what type of latch and    trigger event scheme is used. In, for example, a synchronous clocked    pipeline the delay time needed with respect to a next triggering    event is proportional to the global clock period. Depending on the    latch and triggering event scheme used, the delay time can be either    half a clock cycle or a full clock cycle. Note that the valid    indication can be further delayed if the stage is stalled.-   STALLING: We define stalling of a stage to mean that a data item is    stored, and held, in the stage for current or later use, or    potential use, for more time than required to move data through the    stage during unobstructed propagation of data through the pipeline.    Note that a stage is not considered stalled until the arrival of a    stage triggering event, that, if no stall condition was present    would have caused the stage to pass, or store, new data. Note that    stalling may be more correctly referred to as pausing as the    propagation of the stalled data is momentarily paused while the    stage is stalled.-   INTERLOCK HANDSHAKE: When valid and stall indications are both    present in a progressively stalled pipeline, the operation of each    stage can be controlled through a valid-stall handshake protocol.    The valid and stall indications are used in so called handshakes to    signal if valid data is arriving to a stage and if the stage needs    to stall. Such a handshake protocol ensures that a stage only passes    data when the stage receives valid data and the stage does not need    to stall. The handshake protocol also ensures that currently stored    valid data is held until a downstream stage is ready to receive the    data by indicating that it is not stalled. Using handshake protocols    to control propagation of data between stages is a technique to    interlock the operation of adjacent stages in the pipeline. In    asynchronous pipelines a similar, but also substantially different    (as elaborated later), stage interlocking concept is used through    what is referred to as a request-acknowledge handshake protocol.    Handshake protocols and stage interlocking of asynchronous pipelines    is well known in the art. However, handshake protocols and    techniques described herein that can provide interlocking between    stages in pipelines other than asynchronous pipelines is novel.    Storing more than one data item in a storage device containing a    plurality of storage nodes, such as a master-slave, flip-flop, or    sense-amplifier latch, as described herein is novel also in    asynchronous pipelines.-   NACKING PROTOCOL: The progressively stalled and interlocked    pipelines of the present invention make use of a nacking stall    protocol through its use of stall signals. Asynchronous pipelines    make use of a nacking stall protocol through its use of acknowledge    signals. A nacking protocol indicates to a current stage that new    data is not accepted by a downstream stage because the downstream    stage is stalled. A nacking protocol, in contrast, indicates to a    stage that the data currently held in that stage has been stored by    a downstream stage and new data can now be stored in the current    stage. These protocols are substantially different. For example,    asynchronous pipelines cannot operate solely on a nacking protocol    as there is no signal to provide a time reference for when it is    safe to pass a next data item through a stage without risking to    overwrite downstream data.

DESCRIPTION OF PREFERRED EMBODIMENTS OF THE INVENTION

A contribution of the present invention is to achieve interlockingbetween stages in pipelines other than asynchronous pipelines (althoughthe techniques apply also to asynchronous pipelines). FIGS. 2A and 2Billustrate examples of two preferred embodiments. FIG. 2A illustrates anabstract view of an interlocked pipeline 30 where the interlockinghandshakes are generated in a distributed fashion by a logic circuit(CCL) 31 and through control/datapath logic (DP/CTL) 32, local to eachstage. FIG. 2B illustrates an abstract view of an interlocked pipelinewhere the interlocking handshakes are generated in a centralized fashionby a common logic circuit 39, such as a state machine. Although notillustrated in FIG. 2B, the common logic circuit 39 may of course be anabstraction of distributed control logic, and may of course receivecontrol and data signals from each pipeline stage and the environment ofthe pipeline. The register stages (SD) 33 of each stage are triggered bylocal trigger events generated by the CCL 31/39 (sdtrig 34 in FIGS. 2Aand 2B). The CCL is in turn triggered by one or more stage triggeringevents (strig 35 in FIG. 2A and FIG. 2B). For example, such stagetriggering events would, in a synchronous pipeline, be that of a globalclock, while in an asynchronous pipeline the triggering events would bethe events of a request-acknowledge handshake. The handshakes betweenstages of the pipeline and between the pipeline and its environment arebased on data valid 36 and stall 37 indications, or signals.

The improved storage properties of the present invention is applicablealso to asynchronous pipelines. The common logic circuit 39 in FIG. 2Bin that case contains a request-acknowledge handshake generation anddistribution network that generates the sdtrig 34 events for eachpipeline stage, and the valid 36 and stall 37 interface signals may bereplaced by request and acknowledge signals if the environment isasynchronous. Although such an asynchronous pipeline already has theability to progressively stall the pipeline by means of techniques inthe art, the present invention can still provide improved storage ifstages with two, or more, storage nodes, such as pulsed master-slavelatches, flip-flops, or sense-amplifier latches, are used.

Note that the valid signal arrows, e.g. 36, are dashed to indicate thatthese are optional. In the present invention, a pipeline where no validsignals are present, but stall signals, e.g., 37 are present, implementsa progressively stalled pipeline where stages of a pipeline are stalledstage by stage in a “cycle by cycle” fashion. In the present invention,a pipeline where valid and stall signals are both present implements aninterlocked pipeline working substantially similar to a progressivelystalled pipeline, but, which only passes valid data and only stalls astage if it contains valid data.

As illustrated by FIG. 2C, interlocking between stages can be achievedin general pipeline structures and not just linear pipelines. FIG. 2Cillustrates a collection of interlocked pipeline stages 42, 43 that cancommunicate with each other by exchanging data and valid-stall handshakesignals. In the illustrated structure, any input stage 42 cancommunicate with any output stage 43, and vice versa. The DP/CTL logic44 can itself be a collection of pipelines. Each stage is triggered byone or more stage trigger events 45. In a synchronous integratedcircuit, these stage trigger events would all be generated by the sameglobal clock. In a locally clocked pipeline, some stage triggeringevents may be generated by different clocks than others. In anasynchronous pipeline the stage triggering events for a stage would be arequest-acknowledge handshake for that stage performed on request andacknowledge signals replacing the shown valid and stall signals, ratherthan performed on the strig 45 signals. Again, note that the validsignals are optional which is illustrated by valid signal arrows beingdashed. The progressively stalled and interlocked pipelines of thepresent invention can be applied to a wide variety of integratedcircuits such as, for example, microprocessors and ASICs.

The flowcharts in FIGS. 2D and 2E illustrate how stages in a pipelinecan be stalled progressively, one stage at a time. The method forprogressive stalling observes that in many pipelines where data moves ina lock step, or similar, fashion, only every other storage deviceactively stores data at any given time. This leaves every other storagedevice empty. These empty storage devices can thus be used as buffersthat can be progressively filled with data during a stall condition. Theresult is that an indication that a downstream stage is stalled canpropagate backwards in the pipeline in a delayed fashion such that, in alinear pipeline, only at most two storage devices need to stall percomputation cycle. In a two phase linear pipeline where alternatestorage devices store data at alternate times only at most one storagedevice needs to stall per half-cycle.

The flowcharts of FIGS. 2D and 2E show abstracted behaviors of twodifferent types of storage devices. The flowchart of FIG. 2D illustrateshow data propagates through a pipeline where the storage device of astage contains two storage nodes (as in a non-split latch pipeline) andthe valid and stall indications are only visible at the interface of thestorage device, i.e., the behavior is described as a method notdependent on specific implementation details. The flowchart of FIG. 2Eillustrates how data propagates through a pipeline where the storagedevice of a stage contains only one storage node (as in a split latchpipeline). Again the behavior is described as a method not dependent onspecific implementation details.

Returning to FIG. 2D, in initialization step 50 the larger circuit(e.g., register, circuit, chip, system, etc.) in which a stage resides,is initialized with data indicated as not valid and no stall isindicated. Note that the steps associated with indications that data isvalid or not are optional and hence marked with dotted lines, e.g., 52.In a pipeline where no valid indications exist, the dotted arrows areignored and dotted boxes are replaced by a solid line. Pipelines onlyindicating stalls implement a progressively stalled behavior. Pipelinesthat indicate both valid and stalls implement an interlocked behavior.

After initialization, the pipeline stage under consideration, called thecurrent stage, waits for a triggering event in step 51. When thetriggering event arrives the arriving data is checked to see if it isvalid or not in step 52. If the data is not valid, no action isperformed and we return to step 51. If the data is valid it is stored inan output storage node of the stage in step 54 and the output isindicated as valid in concurrent step 55. When the next triggering eventarrives in step 56 the new arriving data is checked for validity in step57. If it is not valid and the adjacent downstream stage is not stalledin step 58, the current stage will become empty and is indicated as notvalid in step 59 and we return to step 51. If the new arriving data isnot valid and the downstream stage is stalled in step 58, then the datain the output node of the current stage must be held so we return tostep 56. However, since storage space is still available in the internalnode of the current stage there is no need to indicate a stall yet.

If the arriving data is valid in step 57 and the downstream stage is notstalled in step 60, the pipeline operates as normal and stores thearriving data in the output node in step 61 and we return to step 56. Ifthe downstream stage is stalled in step 60 however, then the arrivingdata is stored in the internal storage node of the current stage and thestage needs to stall as in step 62 as there are no more empty storagenodes available to receive additional data. Once stalled, the currentstage waits by looping through steps 63,64 until the downstream stage isno longer indicated as stalled in step 64 at which time the currentstage moves the data currently in the internal node to the output node,and indicates that it is no longer stalled as in step 65 and we returnto step 56.

Now consider the operation of a two phase, or similar type, pipeline asdescribed by FIG. 2E. Again, lines and boxes associated with validindications are dotted and are ignored in pipelines without validindications. After initialization in step 70, the pipeline stage inquestion, called the current stage, waits for an odd numbered triggeringevent in step 71. When the triggering event arrives the arriving data ischecked to see if it is valid or not in step 72. If the data is notvalid, the output node, if not already, is indicated as not valid instep 73 and we return to step 71. If the data is valid in step 72 it isstored in the output storage node of the stage in step 74 and the outputis indicated as valid in concurrent step 75. At the next even numberedtriggering event in step 76, only the stall status of the stage isupdated, the data storage device does not store new data. If thedownstream stage indicates a stall in step 77 after the even numberedtriggering event arrives in step 76, the current stage stalls as in step78 and waits for the next even numbered trigger event in step 76. If thedownstream stage is not indicated as stalled in step 77, then the stageis unstalled (if it was stalled) as in step 79 and we return to step 71to wait for the next odd numbered triggering event.

Note that for precharged stages, such as precharged domino logic, step74 includes first evaluating the stage before storing the evaluated datain the output node, and step 79 includes the precharging of the stage. Aprecharged stage normally evaluates and precharges on alternatetriggering events (odd vs. even). When a precharged stage is stalled,the data is held in the output storage node of the precharged logic. Noprecharging and no evaluation takes place in a stage while the stage isstalled.

In the described fashion the extra storage space found in, for example,many modern synchronous pipelines can be used advantageously by thepresent invention to allow a cost-effective progressive, stage by stage,“cycle by cycle”, stalling by allowing the stall indication to upstreamstages to be delayed while filling in the empty storage nodes witharriving data.

There are several fundamental differences in the stalling of a pipelinestage in the present invention to that of the stalling of a stage in aprior art asynchronous pipeline of which some are outlined below. First,the techniques of the present invention allows stalling of pipelineswhere the stages operate in lock step, or similar, fashion as opposed toasynchronously. Second, the present invention enables the effectivestorage capacity in stalled portions of the pipeline to be doubled, ormore, by realizing the possibility of storing multiple data items instages with multiple storage nodes, such as pulsed master-slave latches,flip-flops, and sense-amplifier latches. Third, the present inventionmakes use of a nacking, rather than nacking, stall protocol. Nackingprotocols cannot be used as the sole means of operating an asynchronouspipeline, but do work fine in, for example, any pipeline where atriggering event can be used to sample the value of the stall signal as,for example, in a synchronous pipeline.

The following sections of the preferred embodiment will describe themethods and techniques of progressively stalled and interlockedpipelines in more detail. As outlined above, the pipeline does notnecessarily have to be synchronous, but may in fact be locally clockedor asynchronous instead. However, to facilitate understanding of thepresent invention the below described detailed examples are presented inthe context of synchronous pipelines for example only. Application ofthe present invention to locally clocked and asynchronous pipelines aresimple variations of the methods and techniques presented herein withreference to the below described detailed examples which are readilyapparent to one skilled in the art. Similarly, application of the ESPand ISP techniques of the present invention to different types ofstorage devices (e.g., latches) and different implementations of thelocal trigger event logic (e.g., clock splitter logic) is readilyapparent to one skilled in the art.

Thus, according to a preferred embodiment of the present invention withregard to synchronous pipelines, clock gating is provided at theindividual pipeline stage (or individual latch macro) level. In a firstpreferred embodiment, an Elastic Synchronous Pipeline (ESP)pauses/stalls stages (i.e., gates off each stage's clock to stall thatstage) in reverse synchronous order from a detected stall condition. Asecond preferred embodiment, an Interlocked Synchronous Pipeline (ISP)is an enhancement of ESP that employs a valid data signal for optimallocal clock gating that is based on both data valid and stallconditions. The present invention avoids the aforementioned progressivestalling problems by allowing data to be stored in both master and slavelatches/stages during stall conditions, thereby doubling the effectivepipeline storage capacity.

Above described state of the art approaches to clock gating provide farfrom optimal power savings. Clock gating has traditionally beenperformed at the coarse-grained unit level based on unit inactivity.Only recently have pipeline clocks been gated at the more fine grained,pipeline stage level based on data validity. The inventors havediscovered that clock gating based on stall conditions not only providesconsiderable clock related power savings, but also improves data pathdelay, power, and area by removing the need for data hold multiplexors.It is estimated that the present invention may save twice as much clockpower as prior art approaches by gating at the fine grained pipelinestage level.

FIG. 3A shows a representative example of a typical pair of seriesconnected register stages 100 and 102 (each representing multipleindividual latches in a particular stage, e.g., master or slave) of afirst preferred embodiment Elastic Synchronous Pipeline (ESP), clockedby global clock (clk) 104. A register-enable signal input 106, 108 toeach register controls whether the particular register 100, 102,respectively, switches at its respective clock edge or maintains itscurrent data (pauses/stalls) because the pipeline is stalled at adownstream stage. Enable input 106 is also a stall indication outputfrom stage 102. Stage 100 also includes a stall indication output 110.Stall indication outputs 106, 110 are latched outputs that indicate tothe adjacent upstream stage that the respectively stage 102, 100 ispaused/stalled. In this example, the register stages 100, 102 becomeopaque (latched) and transparent (passing its input 112, 114 to itsoutput 114, 116, respectively) on opposite edges of the clock, i.e., atfalling and rising clock edges. In this example, when both registerenable inputs 106, 108 enable the clock, the corresponding registerstages 100, 102 sequentially store data, i.e., become opaque at fallingand rising clock edges, respectively. Data can be trapped in either/bothregister stages 100, 102 by dropping register enable inputs 106, 108,thereby holding respective registers 100, 102 opaque.

Synchronous pipelines traditionally prevent data races between latchesby alternating the transparency and opaqueness of latches in adjacentregister stages. The traditional approach to this technique is based onlevel sensitive transparent latches where a two phase clock is used suchthat only every other pipeline stage is active at a time, the latches ininactive stages are opaque and act as barriers preventing data racesbetween the transparent latches of active stages. Similarly, in apipeline where a master slave latch represents a pair of stages themaster and slave latches alternate between transparent and opaque modessuch that there is never a combinational path between two master latchesor two slave latches.

These split latch and non-split latch approaches are notably similar.The only fundamental difference is that the split latch pipeline hascombinational logic between each array of latches (or pipeline stage),while the non-split latch pipeline only has combinational logic betweenmaster/slave latch stage pairs. Another approach to prevent data racesis to add delays to the short paths between latches. This approachallows the use of pulsed latches to save clock power. In both approachestransparent stages contain data and opaque stages contain what isreferred to herein as bubbles. Although described hereinbelow in thecontext of two phase split latch pipelines with level sensitivetransparent latches, it is understood that the present invention hasapplication to many other types of synchronous pipelines.

Elastic Synchronous Pipeline (ESP)

FIG. 3B shows a flow diagram showing how data passing throughsynchronous pipeline register stages 100, 102 of FIG. 3A may bepaused/stalled upon detection of a stall condition. Under normaloperating conditions, the data latches for an active stage aretransparent. When an active stage receives a stall signal fromimmediately following logic or from a downstream stage, the data latchgoes opaque on the next clock edge and remains opaque until the stallcondition goes away. According to this first preferred embodiment of thepresent invention, the data latches are held opaque by gating the localclock with the stall signal. The stall signal in turn is propagatedbackward in the pipeline and is kept synchronized to the pipeline bylatching it at each pipeline stage. The stall signal thus propagatesonly one stage per clock edge, and is thereby kept local to each stage.

So, in initialization step 120 the larger circuit (e.g., register,circuit, chip, system, etc.) in which the register stages 100, 102reside is initialized with the global clock 104 low. Since the pipelineis initially empty, the registers 100, 102 operate substantiallyidentically to other state of the art registers with each subsequentarrival of a respective clock edge in steps 122 and 124. In steps 126,128, respectively, stall outputs 110, 106 are low indicating that nostall has yet been detected and data is passed through the particularstage 100, 102 in steps 130, 132. Coincidentally, in steps 134, 136, theregister enable signal (stall signal) is propagated back through therespective stage 102, 100 as an input to the adjacent upstream stage.

When a stall occurs at a downstream stage, the stall signal propagatesback stage by stage, clock edge by clock edge, until it reaches registerenable input 108 of register 102. Likewise, if a stall occurs in thestage immediately following stage 102, the stall signal is provided toregister enable input 108. Since neither stall indication is high, instep 126 output 110 is checked and in step 130 upstream data is latchedinto latch 100. Simultaneously in step 134, the stall signal is passedthrough from enable input 108 to stall signal indication output 106. Instep 124 at the next clock edge in step 128 stall signal indicationoutput 106 is checked, where a stall condition now has been detected.So, only the enable input (stall signal indication output 106) to latch100 is passed in step 138 to reflect the stall at stall signalindication output 110. Data in both registers 100, 102 remainsunchanged. Thus, when the next clock edge arrives at step 122, the stallsignal indication output 110 is high; and, only the stall signal stateat enable input 108 is passed to stall signal indication output 106.Again, stages 100, 102 are paused/stalled, storing any data containedtherein.

Eventually, the stall condition ends and the stall indication signal atregister enable input 108 switches its state to indicate that change. Atthe first clock edge in step 122 after the state switch, in step 126stall signal indication output 110 is unchanged and so, the switchedstall signal is passed through from register enable input 108 to stallsignal indication output 106. Since the stall condition has ended, thedata that had been held in stage 102 is stale with the results of thatdata already latched in the adjacent downstream pipeline stage. So atthe arrival of the next clock edge in step 124, the check of stallsignal indication output 106 indicates that stage 102 is no longerpaused and in step 132 the data in stage 100 is passed through stage102. Simultaneously, the switched stall signal is passed through fromenable input 106 to stall signal indication output 110. Thereafter, thestages 100, 102 operate normally until the next stall condition isdetected or propagates back from a downstream stage.

Accordingly, any two data items can be sequentially paused/stalled(stored) in a pair of adjacent, synchronously clocked stages, leveragingthe elastic nature of preferred embodiment pipelines. Further, thissequential storing of data through clock gating (i.e., at 106, 110) usesbackward interlocking in a synchronous pipeline for stage levelhandshaking. Each stage generates a stall signal to its upstreamneighbor that indicates when the stage is ready/not ready to receive newdata.

FIG. 4A shows an example of a four stage, two phase split latch pipeline140 with individual stall latches 142, 144, 146, 148 at each stage,propagating the stall signal backward in the pipeline and FIG. 4B is acorresponding timing diagram. The stall latches 142, 144, 146, 148 areclocked by global clock 150 (gclk) on the opposite clock edge as that oftheir associated corresponding data register stages 152, 154, 156, 158.The global clock 150 to each of the register stages 152, 154, 156, 158is gated in gates 142 g, 144 g, 146 g, 148 g by the output of anassociated stall latch 142, 144, 146, 148, respectively. Also in thisexample, delay gates that may be included to remove the skew between thedata and stall latch clocks are not shown.

The timing diagram example of FIG. 4B and the corresponding sub-trace ofFIG. 4C illustrate the relationship between global and local data latchclock states along with the stall and data signals for each stage. Inthis example, each data item progressing through the pipeline isrepresented by an alphabetic character. In the timing diagram example ofFIG. 4B half levels e.g., 160, of data traces indicate that thecorresponding stages are transparent. Opaque stages are represented byblocks between half levels with a character representing thecorresponding data item currently stored in that stage. The portionbetween dotted lines 162, 164 corresponds in more detail to the subtrace entries of FIG. 4C with data stream A, B, C, D, E being applied tothe pipeline. Enclosed in box 166 highlights how the stall conditionpropagates backward through the pipeline. Clock periods 168, 170, 172,174 and 176 between 162, 164, each contain a high and a low phase, whichare indicated individually in FIG. 4C by an appropriate perioddesignation followed by a phase designation, e.g., 170 l or 170 h. Data(or a stall signal) stored in an the opaque latch (whether gated innormally or held during a pause) is indicated by boldface characters.Data passing through a transparent latch is indicated by non-boldcharacters.

So, in phase 168 h the pipeline is in steady state operation with twodata items continuously present in this portion of the pipeline. Dataregister stages 152, 156 are opaque, storing data items B and Arespectively. Coincidentally, data register stages 154, 158 aretransparent and do not store any data. Stall latches 144, 148 are opaqueand stall latches 142, 146 are transparent. Once the next falling clockedge arrives to start 168 l, data register stages 154, 158 becomeopaque, storing data items B and A, while the stall latches 144, 148become transparent. During 168 l the stall signal (stall) is assertedpassing through transparent stall latch 148 to the clock gate 148 g atdata stage 158, which pauses (stalls) data stage 158. Stall latch 146 isopaque. Stalled data stage 158 continues to store data item A after thenext rising clock edge arrives to start 170 h. In clock phase 170 h,data items C and B are latched into opaque registers 152, 156,respectively. Simultaneously, the asserted stall signal propagatesthrough currently transparent stall latch 146 to the clock gate 146 g.This disables the clock to opaque data stage 156, which contains B.

When the next falling clock edge arrives to start clock phase 170 l,both stages 156 and 158 are stalled, opaque and storing data items B andA, respectively. Data stage 154 in turn becomes opaque storing data itemC and, the asserted stall signal passes through transparent stall latch144 to clock gate 144 g. Transparent data stage 152 passes data item D.The next rising clock edge arrives starting clock phase 172 h with datastages 154, 156 and 158 stalled and storing data items C, B, and A,respectively. At this time, opaque data stage 152 stores data item D andthe asserted stall signal propagates through stall latch 142 to clockgate 142 g. Upon arrival of the next clock edge to start clock phase 172l, all four sections of the pipeline has been safely stalled withoutlosing any data items. All stages in the pipeline are filled with validdata items A, B, C and D.

As can be seen from this example, a stall condition can be considered asliding window (e.g., 166) moving backward through the pipeline. Outsidethe stall condition window 166, data is stored normally in every otherpipeline stage as is typical for a two phase split latch pipeline. Sinceall of the latches within the stall condition window 166 are opaque,data is stored in every paused pipeline stage. Thus, preferredembodiment pipelines may be considered elastic due to this adaptivestorage capacity.

Unstalling is similar to stalling the pipeline. Essentially, thepipeline data stages 152, 154, 156, 158 are enabled one stage at a timein the same order that they were stalled. This recreates the pipelinebubbles without losing data when data starts moving through the pipelineagain. So, in clock phase 172 l, all stages remain stalled and stalllatches 144, 148 are transparent. During this phase, the stall signal isdeasserted indicating that the condition that caused the stall no longerexists, i.e., stage 158 no longer need be stalled. The deasserted stallsignal propagates through the transparent stall latch 148 to clock gate148 g, enabling the clock to data stage 158 such that stage 158 is nolonger stalled. Since stage 158 is no longer stalled, the data stagebecomes transparent at the next rising clock edge arrival, i.e. at thebeginning of clock phase 174 h. Stages 152, 154 and 156 remain stalled,storing data items D, C, and B. The deasserted stall signal passesthrough currently transparent stall latch 146 to clock gate 146 g,enabling the clock to stage 156.

When the next clock edge arrives to start clock phase 174 l, data item Bis latched in opaque data stage 158 and data item C is passed throughtransparent data stage 156 with stages 152 and 154 remaining stalled andstoring data items D and C. The deasserted stall signal passes throughtransparent stall latch 144 to clock gate 144 g, enabling stage 154. Atthe next clock edge arrival to begin clock phase 176 h, data item C isstored in stage 156 and stages 154 and 158 are transparent. Thedeasserted stall signal passes through transparent stall latch 142 toclock gate 142 g, enabling the clock to stage 152. Thus, at the arrivalof the next clock edge to start clock phase 176 l and end the stallwindow 166, data items D and C are stored in opaque stages 154 and 158,respectively. Transparent stages 152 and 156 are passing data and thepipeline returns to normal steady state operation.

Thus, heretofore unrealized, a two phase pipeline can be stalledprogressively as described above because filling bubbles normallypresent in the pipeline with data items masks the “delay” of propagatingthe stall signal backward in the pipeline one stage at a time. With Nstages (N=4 in the example of FIGS. 4A-C) in a pipeline, normally nomore than N/2 data items (2) are present in the pipeline at steadystate, while bubbles occupy the remaining N/2 stages. The presentinvention uses these N/2 bubbles as data buffers during a stall. Thestall signal propagates back two stages each clock period and so, takesN clock edges (and N clock phases or N/2 clock periods) to propagateback to the start of the pipeline. During these N phases, new data itemscontinue to enter the pipeline (in a two phase pipeline new data entersthe pipeline only every other clock edge). Normally, there is enough(normally unused) buffer storage such that all data can be storedsafely. Thus, when all stages have stalled, the pipeline has anoccupancy potential of N data items. Likewise, when unstalling thepipeline, the delay introduced by propagating the stall signal backwardone stage at a time recreates the pipeline bubbles such that data safelypropagates through the pipeline again. With the whole pipelineunstalled, the occupancy potential of the pipeline returns to N/2 dataitems.

Interlocked Synchronous Pipeline (ISP)

The second preferred, ISP embodiment augments the ESP embodiment usingvalid data signals at each stage to identify holes (absence of a validdata item) in the pipeline where it is unnecessary to pause thepipeline, thus improving throughput. FIG. 5 shows an example of a fourstage synchronous pipeline 180, wherein each register stage 182, 184,186, 188, includes a valid bit latch 190, 192, 194, 196 as is known inthe art. As data enters the pipeline 180, it is accompanied by a 1 bitvalid data signal or valid data bit that propagates alongside the validdata item in synchronous lock step. In this example, each valid data bitgates the clock to the corresponding stage, blocking the clock whenvalid data is not present in a particular pipeline stage. Thus in thisexample, A, B and C indicate valid data in stages 182, 186 and 188, eachof which is accompanied by a “1” indicating a valid data item. The hashmark “#” in stage 184 indicates the absence of valid data and isaccompanied by a “0” in valid data latch 194. In the ISP embodiment, theESP decision to stall an upstream stage is modified by determining fromthe valid bit whether that stage contains valid data and so, should bestalled. Such an ISP embodiment improves pipeline throughput by fillingholes and further reduces clock power because with the local stage clockgated by both the valid data signal and the stall signal. Only validdata propagates through the pipeline and power is consumed only instages with valid data.

So, according to the ISP embodiment, during a stall condition, eachvalid data latch for each stage must be clock gated together with thedata latches to correctly propagate or stall each valid data bit alongwith its associated data item. Since a stall condition only needpropagate backward when the upstream stage contains valid data, a validdata signal or bit that propagates with each valid data item indicateswhether the particular stage contains a valid data item that may be lostand, therefore, that must be paused/stalled upon a stall. Thus, when astage has the valid data bit asserted, the stage may be stalled asdescribed above for ESP; when a stage does not have the valid data bitasserted, its absence overrides the stall bit, effectively stalling thestall bit, until valid data reaches that stage. Including the valid databit in deciding whether to stall individual stages improves pipelinelatency and throughput in the presence of stalls, because data inupstream stages can continue through the pipeline until all holes havebeen filled. Thus, unless the pipeline completely fills with valid dataitems, the stall may be transparent to other upstream units external tothe ISP.

Further, by interlocking pipelines control whether a data item continuesthrough the pipeline can be decided locally. Therefore, decisions suchas whether to clock gate or not clock gate and, whether to pause orrestart a pipeline stage can be made independent of other pipelinestages. The ability to perform such local decisions is achieved throughhandshake interlocking in both directions, forward as well as backward.Handshake signals indicate to neighboring stages whether there is dataavailable and, whether a stage is ready to receive new data or not.Since these are local handshake signals that affect a relatively fewlatches, clock gating through interlocking techniques can be appliedeven to very high frequency pipelines.

FIG. 6A shows a representative example of a typical pair of seriesconnected synchronous pipeline stages 200 and 202 (each representingmultiple individual latches in a particular stage) illustrating the ISPpreferred embodiment and substantially similar to the register stages100, 102 of FIG. 3A. Each stage 200, 202 latches data responsive to asynchronous clock (clk) 204. A stage enable input 206, 208 for aregister control signal to each stage, in part controls whether theparticular stage 200, 202, switches at its respective clock edge ormaintains its current data contents (pauses/stalls) because of a stallcondition. Enable input 206 is also a stall indication output from stage202. Stage 200 includes a stall indication output 210. Stall indicationoutputs 206, 210 indicate to the adjacent upstream stage that thecurrent stage 202, 200 is paused/stalled. In this example, the registerstages 200, 202 become opaque (latched) and transparent (passing itsinput 212, 214 to its output 214, 216, respectively) on opposite edgesof the clock, i.e., at falling and rising clock edges. Each stage 200,202, also includes a valid data input 218, 220 that indicates thatcorresponding incoming data 212, 214 is valid; and a valid data output220, 222 that indicates that the respective stage's output 214, 216 isproviding valid data.

In this example, only when both the respective stage enable outputs 206,210 indicate the absence of a stall and the corresponding incoming datavalid bit 218, 220 indicate that incoming data is valid, is the clock204 enabled for that register stage. When the clock 204 is enabled forboth register stages, the register stages 200, 202 sequentially storedata, i.e. become opaque at falling and rising clock edges,respectively. Valid data can be trapped in either/both register stages200, 202 by dropping register control signals to stage inputs 206, 208,thereby holding respective registers 200, 202 opaque.

FIG. 6B is a flow diagram showing how data passing through pipelineregister stages 200, 202 of FIG. 6A may be paused during a stall. Undernormal operating conditions, only valid data is propagated through thepipeline. When a stall occurs, the foremost stages with valid data goopaque on the next clock edge and remain opaque until the stallcondition ends. Valid data in earlier stages continues to propagatethrough the pipeline until it reaches the last unstalled stage, i.e.,wherein the adjacent downstream stage is stalled, at which time thatstage is paused/stalled. According to this ISP preferred embodiment, thedata latches are held opaque by gating the clock with both the stallsignal and the corresponding data valid bit. The stall signal in turn ispropagates backward in the pipeline until it encounters empty stages andis kept in synchronous lock step to the pipeline by latching it at eachpipeline stage. The stall signal thus propagates no more than one stageper clock edge, filling holes as it propagates and is thereby kept localto each stage with valid data.

So, in initialization step 230 the larger circuit (e.g., register,circuit, chip, system, etc.) in which the register stages 200, 202reside is initialized with the global clock 204 low. Since the pipelineis initially empty, the registers 200, 202 operate substantiallyidentically to other state of the art registers, upon each arrival of arespective clock edge in steps 232 and 234. In steps 236, 238,respectively, stall outputs 210, 208 are not asserted because,initially, a stall has not been detected yet. Coincidentally and inparallel, valid data signal outputs 222, 220 are checked in steps 240,242 to determine whether the stall bit should be propagated. Eachrespective stall bit is propagated in steps 244, 246, only if thecorresponding data valid output 222, 220 is asserted. Thus, the stalloutput 210, 206 is not asserted in steps 236, 238 if either, a stallcondition is not propagating back through the pipeline or, therespective stage 200, 202 does not contain valid data. If a stall output210, 206 is not asserted in steps 236, 238, then in steps 248, 250, thedata valid inputs 218, 220 are evaluated to determine if valid data isbeing provided to a respective stage 200, 202. If a data valid input218, 220 indicates that valid data is available, then in steps 252, 254only the data valid input bit 218, 220 is passed to data valid outputs220, 222. Otherwise, when valid data is provided to either/both stageinputs 212, 214, coincidentally data is passed through the particularstage 200, 202 in steps 256, 258 and the data valid input bit 218, 220is passed to data valid outputs 220, 222 in steps 260, 262.

When a stall occurs in a downstream stage, a stall signal propagatesback stage by stage, cycle by cycle, until it reaches clock enable input208 of latch 202. Likewise, if a stall occurs in the stage immediatelyfollowing stage 202, the stall signal is provided to clock enable input208. If neither stage 200, 202 contains valid data, the stall indicationcontinues to be ignored and in steps 252, 254, only the valid datasignal state is latched and forwarded in stages 200, 202 until validdata arrives at the second stage 202, i.e., valid_out is asserted. Withvalid_out asserted, the stall signal begins to propagate back throughthe stage 202 in step 244. However, when the stall indication firstarrives, neither stall indication output 206, 210 is asserted whenoutput 210 is checked in step 236. In step 248 the valid data bit input218 for upstream data is checked. If valid data is being provided, boththe data and the corresponding valid data signal are latched into stage200 in steps 256, 260; otherwise, only the valid data signal is latchedin step 252.

In step 234 at the next clock edge, it is determined when stall signalindication output 206 is checked in step 238, that a stall condition hasoccurred and stage 202 is paused. Simultaneously, in step 242 thecontents of stage 200 are checked and if they are not valid, the stallsignal is not propagated back; otherwise if in step 242 stage 200 isfound to contain valid data, then in step 246, the stage enable input(stall signal indication output 206) to stage 200 is passed to reflectthe stall condition at stall signal indication output 210. Data in bothregisters 200, 202 remains unchanged. Thus, when the next clock edgearrives at step 232, the stall signal indication output 210 is high; andin step 244, only the stall signal state at stage enable input 208 ispassed to stall signal indication output 206. Again, data in bothregisters 200, 202 remains unchanged and the stages 200, 202 are paused.Any holes that may have existed between the two data items in stages200, 202 have been eliminated during the selective pause/stall of thesetwo stages. Thus, some of the degraded performance that occurred fromstalling the foremost data item in stage 202 may be recovered bysubsequent data items.

Eventually, the stall condition ends and the stall indication signal atstage enable input 208 switches state to indicate that change. At thefirst subsequent clock edge in step 232 the stall signal indicationoutput 210 is unchanged in step 236 and so, stage 200 is unchanged.Again, simultaneously and in parallel, in step 240 it is determined thatstage 202 contains valid data and in step 244, the switched stall signalpasses through from stage enable input 208 to stall signal indicationoutput 206. Since the stall has ended, the data that had been held instage 202 is stale; the results of that data has already been latched inthe adjacent downstream pipeline register stage. So, at the arrival ofthe next clock edge in step 234, the check of stall signal indicationoutput 206 indicates that stage 202 is no longer paused and in step 250,incoming data is checked to determine if it is valid. Valid data instage 200 and its associated data valid signal are passed to stage 202in steps 258, 262; otherwise, only the valid data signal is passed instep 254. Simultaneously, in step 246 the switched stall signal ispassed through from stage enable input 206 to stall signal indicationoutput 210. Thereafter, the stages 200, 202 operate normally until thenext stall is detected and propagates back from a downstream stage.

FIG. 7A shows an example of a four stage, two phase synchronous pipeline270, each stage 272, 274, 276, 278 including an internal stall bit latch272 s, 274 s, 276 s, 278 s and a valid data bit latch 272 v, 274 v, 276v, 278 v for forward and backward interlocking and clocked by globalclock (gclk) 280. Logic gates, e.g., 286, 284 and 282 at each stage 272,274, 276, 278 gate global clock 280 to the respective stall latch, validdata bit latch and register data latches. The input to the valid databit latch 272 v, 274 v, 276 v, 278 v indicates that associated data isvalid and should be passed to intervening logic 288, 290, 292 or 294.Each stall latch 272 s, 274 s, 276 s, 278 s is clock gated by the outputof an associated valid data latch 272 v, 274 v, 276 v, 278 v. Thisensures that holes in the pipeline are filled by preventing the stallfrom propagating upstream when there is no valid data present.

FIG. 7B is a corresponding timing diagram and FIG. 7C is a detailed subtrace of FIG. 7B between dotted lines 298, 300. As in the example ofFIGS. 4B-C, each data item progressing through the pipeline isrepresented by an alphabetic character. Invalid data (a hole) isrepresented by a # symbol. Data trace half levels indicate that thecorresponding stages are transparent. Opaque stages are represented byblocks between half levels with a character representing thecorresponding data item currently stored in that stage. Under normaloperating conditions, the data latches for an active stage aretransparent and only valid data is propagated through the pipeline. Whenan active stage generates a stall signal, the data latches with validdata go opaque on the next clock edge and remain opaque until the stallcondition goes away. Valid data continues to propagate through thepipeline, filling holes until it reaches a stage wherein the adjacentdownstream stage is stalled, at which time that stage is paused/stalled.

Valid data signals propagate forward in the pipeline with valid data. Aswith the above described ESP embodiment, stall signals propagate in thebackward direction of the pipeline. A stall bit indicates when thepipeline must halt, for example, due to access conflicts at a sharedresource.

With a typical globally stalled synchronous pipeline, stall controllogic fills holes and handles stall signals generated by multiplestages. The control logic introduces delays from long global wires, fromadditional stall control logic and from stall signal fan out, whichgrows linearly with the number of stages being driven. These pipelinecontrol delays impact the cycle time in prior art synchronous pipelines.By contrast in a preferred embodiment interlocked pipeline, the stallcontrol logic is contained locally to each stage and so, only adds asmall constant delay. Locally stalled pipelines, therefore, have anadvantage of improving slack on stall signals because they are locallylatched and originated.

In the sub trace of FIG. 7C, the data stream A, #, B, #, C, D, E isapplied to the pipeline 270 of FIG. 7A. Since essentially, invalid datais a don't care, it need not normally propagate through the pipeline270, provided valid data item following invalid data position does notarrive at the end of the pipeline too soon. A valid data item arrivingtoo soon would have to be stalled there. Pipeline stalls can act todelay following valid data items such that such a valid data item canonly arrive after its desired arrival time, causing delays in otherpipelines or units. So, for each stage 272, 274, 276, 278, theaccompanying valid data signal gates locally, blocking the clock to thestage 272, 274, 276, 278, whenever the corresponding valid data signalis a zero. As above, bold text in the trace indicates when data (orvalid/stall) is stored in a corresponding stage, i.e., the stage isopaque. Non-bold text indicates that data (or valid/stall) is passingthrough the stage, i.e., the stage is transparent. Polygons 302, 304illustrate how the clock gated holes propagate forward in the pipeline.Polygon 306 illustrates how the clock gated stall propagates backward inthe pipeline.

When data item A reaches stage 278 a stall is generated for twoconsecutive clock cycles, illustrated by polygon 306. In an elasticpipeline of the ESP embodiment, the stall condition propagates backwardin the pipeline unchanged, stalling each stage including stages withholes for two cycles as described herein above. In an ISP embodiment,however, when a hole is encountered the valid data bit latch contentsoverrides the stall condition by blocking the clock to the stall latchallowing valid data items to continue until it reaches the stalledlatch. Thus, the stall window 306 is truncated when it encounters aninvalid window 302, 304. The override in turn cancels out the invaliddata condition when the hole gets filled with valid data, resulting inthe stall window 306 truncating invalid windows 302, 304.

So, in this example the input data stream contains two holes, one afterdata item A and another after data item B. Thus, according to the ISPembodiment of the present invention, rather than stalling all stages fortwo cycles, stage 278 stalls for two cycles, while stage 276 stalls onlyfor one cycle, and stages 274 and 272 do not stall at all. The stallcondition is shortened by one cycle at stage 276 which, during the firststall cycle, contains an invalid data entry (#) or a hole that followsdata item A. The invalid data signal accompanying the invalid data entryoverrides the asserted stall signal to fill in the hole in the pipelineat stage 276. Thus, the first cycle of the two cycle long stall windowis therefore zeroed out at stage 276 and does not propagate backward inthe pipeline. So, rather than being stalled in stage 274 for two cycles,data item B instead propagates to stage 276 filling the hole there andstalling for one cycle only. Similarly, the invalid data entry followingdata item B propagates to stage 274 such that as the remaining secondcycle of the stall window reaches stage 274, the hole there is filledand zeroes out the stall window completely. Due to the holes in thepipeline, the stall condition never reaches past stage 276, much less tothe start of the pipeline or before, and the input environment does notneed to stall. Therefore, data items C and D do not stall in the datastream but rather, propagate through the pipeline in a normal fashion.

Although the ESP and ISP embodiments have been described hereinabovewith reference to two phase clocked pipelines with split and non-splitmaster/slave registers based on transparent latches, the presentinvention has equal application to any register structure with twostorage nodes or on pulsed latches as are further described hereinbelow.

FIG. 8 shows an example of a stage pair 310, 312 with interlock logicgates 314, 316, 318, 320, 322, 324 for such a two phase clockednon-split latch master/slave based ISP embodiment. The clock 326 isgated at the end of each clock cycle after new data has been received,rather than at the first clock edge as in the two phase, split latch ISPembodiment with early valid. The valid data signal must be in phase withthe clock to guard against local clock gating glitches; therefore, thevalid data signal 328 from the master latch 310 gates the master latchclock in gate 318, the slave stage clock in gate 320 (after it is gatedby the internal stall signal in gate 322), as well as the master stalllatch in gate 318.

It is understood that this master/slave pair 310, 312 is for exampleonly and that, the present invention may be applied to any suitableglitch-free (hazard free) adaptation of local clock gating to amaster/slave pipeline for stage interlocking. Valid data and stallsignals must meet standard synchronous timing constraints for clockgating.

In a two phase pipeline with early valid, glitches on the valid datasignal are filtered out by the clock gating for each stage, by assuringthat the clock is not active during the valid data signal outputsettling time. During this settling time, the valid data signalstabilizes before the next clock edge arrives at the end of the firsthalf of the clock cycle. However, in a late valid pipeline, the validdata signal is taken after, rather than before, the master latch.Glitches, that might occur during the first half of the clock cycle areignored. During the second half of the clock cycle, clock polarityserves to filter out glitches on the valid data signal. So, in a latevalid ISP embodiment, the valid data signal must stabilize before theend of each clock cycle, i.e., before the clock edge starting the nextclock period arrives.

The stall signal has the same timing constraints for both split latchand non-split latch ISP embodiments. Glitches are avoided on the stallsignal during the first half of the clock cycle because the stall latchis opaque. During the second half of the clock cycle, glitches on thestall signal are filtered out at the clock gating logic by the clockpolarity. The stall signal must stabilize before the end of each currentclock cycle, i.e., before the clock edge starting the next clock periodarrives. Delay gates may be inserted on non-gated local clocks to zeroout clock skew that might have been introduced by the gating functionson gated clocks.

Pulsed Latch ISP

FIGS. 9A-B shows an example of a pulsed master/slave pair 340, 342 andclock logic gating functions therefore, which may be used in a pulsedmode ISP embodiment for further reduced clock power over a two phasemaster/slave ISP embodiment. The master/slave pair 340, 342 has twooperation modes, a normal two phase clocked master/slave operation modeand a pulsed operation mode. Normally, in pulsed mode, the master 340remains transparent (master clock is continually hot) and the clock ispulsed to the slave 342. Since the master and slave 340, 342 form abasic two stage latch structure, they can still store two data items,one in each stage. The valid data and stall latches run in normal twophase clocked master/slave mode operation. The same clock gating logic314, 316, 318, 320, 322, 324 shown in the example of FIG. 8 may be usedto control the clock to the valid data and stall latches. An extrastall″ latch is included to avoid turning on the data latch clock(c1_data) early when the stall condition ends, which could happen whilethe slave data stage is still being pulsed. The stall″ latch 344 isclocked by an ungated global clock (gclk) with necessary skew delayadjustment.

Thus, when a stall condition is asserted, the clock logic for the masterlatch reverts back from hot mode to two phase clocked mode. The stallinput to AND gate 346 disables the clock pulse to the slave 342, pausingit and holding the old data value; and, the clock to the master 340 isfirst enabled by the stall signal input to OR gate 348 and then,disabled by stall′ at OR gate 350 to make master 340 opaque also,storing the upstream data item. Thus, two data items are paused, one inthe master 340 and one in the slave 342. When the stall condition isdeasserted, the clock to the slave 342 is enabled again and propagatesthe second data item to the environment. On the next clock edge, themaster is made transparent, and the pair again runs in pulsed mode withthe clock logic configured for pulsed operation.

Due to the asymmetric nature of pulsed master-slave pipelines, when atthe end of a clock cycle the slave latch holds data that must be stalledand new data arriving to the master latch must also be stalled, thenmaster and slave stalls simultaneously. The slave latch stalls analready stored data and the master latch both stores and stalls arrivingdata simultaneously. Note that a pulsed master-slave stage as describedabove can also operate as a pair of split latch stages.

ISP Primitives

The above preferred embodiments have been described with reference to asimple linear pipeline structure. However, a typical pipeline register,circuit, chip, system, etc. may have a much more complex path that canbe viewed as a collection of data flow primitives that steer data todesired locations of the system. These primitives include pipelineforks, joins, branches, and select structures that can be used to buildcomplex pipeline systems. The present invention has application topipelines including such primitives, especially in synchronousinterlocked pipeline structures.

FIG. 10 shows an example of an application of the ISP embodiment of thepresent invention to a 1 to 2 fork stage 370. Generally, a pipeline forkstage is a 1 to N path split, where a data item from an upstream stageflows into all N parallel downstream pipeline stages. A fork stage muststall if any downstream stage in any of its N paths stalls. When a forkstage is stalled, non-stalled downstream stages must be prevented fromreceiving duplicate copies of the data as valid from the stalled forkstage. Thus, the simplest way this can be accomplished is through asynchronized, or aligned, fork stage where the valid data signals to alldownstream stages are zeroed out (indicating invalid data is beingprovided) until all downstream stall conditions have ended. Thus, oncethe stall abates, all downstream stages simultaneously receive the newlyunstalled data. The valid data and stall signal logic for a 1 to Nsynchronized fork stage must satisfy:stall=(stall[1] OR . . . OR stall [N])valid[i]=valid AND —(stall[1] OR . . . OR stall[N])Alternatively, the fork stage can be implemented as a non-synchronized,or nonaligned, fork with the valid and stall logic implemented as astate machine to keep track of whether data has already been copied to adownstream stage or not. In this alternate embodiment, data is copied todownstream stages on an individual basis as they become non-stalled,giving the computation in non-stalled downstream pipelines an earlystart.

FIG. 11 shows an example of an application of the ISP embodiment of thepresent invention to a 1 to 1-of-2 branch stage 380. Generally, apipeline branch stage is a 1 to 1-of-N selector that propagates datafrom an upstream stage to one of N parallel downstream stages. Selectionof the downstream stage is determined by the data path logic thatgenerates a set of N one-hot encoded enabling signals. The enablesignals mask the branch stage valid data signal through a set of ANDfunctions such that the valid data signal propagates only to theselected downstream stage. The branch stage stalls only if the selecteddownstream stage is already stalled. The valid data and stall signallogic for a 1 to 1-of-N branch stage must satisfy:stall=enable[1] AND stall[1] OR . . . OR enable[N] AND stall[N])valid[i]=valid AND enable[i]

FIG. 12 shows an example of an application of the ISP embodiment of thepresent invention to a 2 to 1 join stage 390. Generally, a pipeline joinstage is an N to 1 merger that concatenates data from N upstream stagesto one downstream stage. The join stage must wait until data is valid inall upstream stages before concatenating and propagating the data to thedownstream stage. A join stage synchronizes and aligns data streams frommultiple pipelines. Since data in different upstream stages can becomevalid at different times, any stage that contains valid data must bestalled until all stages have valid data that can pass to the downstreamstage. If the join stage stalls, e.g., because valid data has not yetreached the join stage, all upstream stages must stall. The valid dataand stall signal logic for an N to 1 join stage must satisfy:valid=valid[1] AND . . . AND valid[N]stall[i]=−valid OR stall

FIG. 13 shows an example of an application of the ISP embodiment of thepresent invention to a 1-of-2 to 1 select stage 400 where stage 2 haspriority over stage 1. Generally, a pipeline select stage is a 1-of-N to1 selector that propagates data from one of N upstream stages to onedownstream stage, essentially providing a basic if-then-else multiplexorfunction. A select stage waits until data is valid in at least one ofthe upstream stages. One stage is then chosen through priority basedselection and valid data from the selected stage propagates to thedownstream stage. Every other upstream stage that contains valid datamust stall until it is selected. The data, valid data and stall signallogic for a 1-of-N to 1 select stage, where a higher index i indicates ahigher priority, must satisfy:valid=valid[1] OR . . . OR valid[N]stall=stall OR ((i<N) AND (valid[i+1] OR . . . OR valid[N])))data=if (valid[N]) data[N] elsif . . . elsif (valid[1]) data[1]A select stage also acts as a priority arbiter deciding which upstreamstage wins the arbitration and which competing stages, if any, muststall. State based selection, rather than priority selection, can beimplemented through state machines.

FIG. 14 shows an example of an application of the ISP embodiment of thepresent invention to multicycle pipeline 410, in this example an N-cyclecircular pipeline structure (a ring) with an input stage 412 and anoutput stage 414 for reading in data from, and writing out data to, anenvironment. The input stage 412 of the ring 416 is implemented as aselect stage 418 and the output stage 414 is implemented as a branchstage. This pipeline example allows multiple multicycle computations tobe interleaved in the ring 416 for maximal throughput. Every cycle thefeedback stage 420 input to the select stage 418 is checked for validdata. If the feedback stage 420 input is not valid, new data is readinto the ring 416 from the input stage 412, if available. In the branchstage 422, the data path logic determines if the current data needs tocontinue iterating through stage 420 to the ring 416, or if it should bewritten to the output stage 414 and generates an enabling signal (notshown), accordingly.

Accordingly, the present invention has application to custom pipelinestructures and behaviors by providing suitable logic for generatingappropriate valid data and stall signals. In particular, logic functionsfor the valid data and stall signals can be described in any well knownspecification language such as VHDL or Verilog, and then synthesized toa gate netlist using standard synchronous synthesis tools.

ISP Storage Properties

Advantageously, preferred embodiment pipelines can store more data thanwhat was heretofore possible in synchronous pipelines and queues. Atypical N-stage prior art synchronous first-in, first-out (FIFO)register can store up to N/2 data items. When the FIFO contains no morethan N/2 data items (i.e., it has an occupancy less than or equal toN/2), the latency of a preferred embodiment ISP FIFO and a normalsynchronous pipeline is substantially the same. However, while theN/2+1^(st) data item would stall the prior art FIFO, a preferredembodiment ISP FIFO continues accepting inputs past N/2 valid dataitems, storing up to N data items before being unable to accept inputdata items. Thus, with between N/2 and N items occupancy, the latencythrough the ISP FIFO is directly proportional to the occupancy becauseISP storage capacity and latency varies dynamically with theinput/output rate of data items.

Therefore, because the ISP of the present invention has double theeffective storage capacity of prior art pipelines, ISP queues may beconsiderably smaller than normal state of the art queues and stillprovide more storage capacity. Thus, ISP queues save significant areaand power at the same average performance. The elastic storageproperties can also be used advantageously in more general pipelinestructures where the extra storage capacity may reduce or eliminate theneed for extra pipeline buffer stages, e.g., FIG. 1B. In particular, theelastic storage can provide the staging latches needed to stall highfrequency pipelines, saving power, area, and delay.

A first-in, first-out register is a pipelined structure in which dataitems are queued. Data is taken out of the first-in, first-out registerin the order it was inserted. A queue structure is a generalized versionof the first-in, first-out register where data is not necessarily takenout in the same order it was inserted. In the most general concept of aqueue, data can be inserted in any place in the queue at any time andtaken out from any place in the queue at any time. Examples of queuestructures are last-in, first-out registers, and issue queue registers.The ability of ISP pipelines to double the effective storage capacity isalso applicable to such general queue structures.

CONCLUSIONS

Advantageously, the ISP embodiment significantly reduces clock powerconsumption in high frequency, high performance microprocessors, evenfurther than the ESP embodiment. The ISP embodiment provides astructured and well defined approach to fine-grained clock gating at thepipeline stage (or individual latch-macro) level using the preferredvalid/stall handshake protocol to determine when and whether the stageshould be clocked. Stages are clocked only when the input contains validdata and the output is not experiencing a stall (data hold). The ISPembodiment provides a designer friendly approach for specifying andimplementing clock gating to achieve the finest granularity of clockgating yet realized, i.e., at the pipeline stage (latch-macro) level andis compatible with synchronous design methodologies that support clockgating.

Thus, the ISP embodiment extends the locally stalled pipelines of theESP embodiment to provide optimal local clock gating for synchronouspipelines, providing a practical and cost effective clock gatingtechnique based on both valid data and stall conditions. The presentinvention has application to generalized pipeline structures and may beimplemented with two phase, pulsed, pre-charge and other appropriatelatches. In modern microprocessors clock power is estimated to bereduced to up to 5 times lower than clock power in prior art non-gateddesigns. The amount of power savings of course varies depending on themicroarchitecture used and what program is running.

Furthermore, by temporarily storing data in both master and slave stagesduring stalls, the present invention overcomes the classic overwrittendata problem normally encountered when progressive stalling synchronouspipelines conditions. This is key for using stallable pipelines at veryhigh clock frequencies.

In summary, the present invention and especially the ISP preferredembodiment provides a significant design effort reduction; the ISPembodiment provides a natural, clearly defined and structured approachto clock gating based on well known handshake concepts. Handshake basedinterlocking enables direct integration of asynchronous pipelinesegments in synchronous pipelines with minimal control logic redesign.Clock power is minimized, especially with gating the clock at the stage(latch-macro) level based on both invalid data and stall (data hold)conditions. The present invention is very flexible; clock gating basedon valid/stall handshaking protocols can be applied in any combination(only valid, only stall, both valid and stall) and at any level ofgranularity (unit, pipeline stage, latch-macro, and anything between),so that the designer has maximum flexibility in deciding what extent togate the clock. Because gating decisions are made local to each latchmacro, slack may be reduced on valid/stall signals, enabling progressivestalling of high frequency pipelines without having to introduce staginglatches and saving additional power by allowing earlier clock gating.The present invention increases effective storage capacity; the elasticstorage properties of preferred embodiment pipelines allows such apipeline to hold up to twice as many data items (one data item in eachof the master and slave) as a typical prior art pipeline. In particular,this increases storage capacity in queue structures. Also, data patharea, power, and delay are improved by eliminating the need for datahold muxes. Finally, preferred embodiment pipelines are fully testableusing stage of the art testing techniques. Although data is stored inboth master and slave stages, it is fully testable using for example,level sensitive scan design (LSSD) techniques without additional scanlatches or logic structures.

While the invention has been described in terms of several (example)preferred embodiments, those skilled in the art will recognize that theinvention can be practiced with modification within the spirit and scopeof the appended claims.

1. A synchronous integrated circuit comprising: a global clock; a synchronous pipeline clocked by said global clock, said synchronous pipeline including a plurality of register stages, data propagating through said synchronous pipeline entering a first register stage and passing through multiple downstream register stages; and each register stage of said synchronous pipeline receiving a stage stall signal, said stage stall signal latching responsive to said global clock and selectively stalling said each register stage, said latched stall signal being said stage stall signal for an upstream pipeline stage, each of said plurality of register stages being individually stalled by a downstream stage stall signal, upstream stages remaining unstalled when a stall is propagating through downstream stages until the stall propagates upstream to remaining unstalled stages individually as a respective stage stall signal is provided to the respective unstalled upstream stage.
 2. A synchronous integrated circuit as in claim 1, wherein said each register stage is selectively clocked by a local clock and further comprises: a stall bit latch latching said stage stall signal responsive to said global clock and providing said latched stall signal; and a local clock generator receiving said global clock and clocking said register stage responsive to said stage stall signal.
 3. A synchronous integrated circuit as in claim 1, wherein alternate pipeline stages latch at alternate clock phases.
 4. A synchronous integrated circuit as in claim 1, wherein said each register stage is a master/slave stage.
 5. A synchronous integrated circuit as in claim 1, wherein said each register stage is a pulsed latch stage.
 6. A synchronous integrated circuit as in claim 1, wherein said synchronous pipeline comprises a first-in, first-out register.
 7. A synchronous integrated circuit as in claim 1, wherein at least one pipeline stage receives a stall indication from logic driven by said at least one pipeline stage, said stall indication being said stage stall signal to said at least one pipeline stage.
 8. A synchronous integrated circuit as in claim 7, wherein at least one second pipeline stage provides a stall indication to logic driving said at least one second pipeline stage.
 9. A microprocessor comprising a synchronous integrated circuit as in claim
 8. 10. A synchronous integrated circuit comprising: a common global clock; a synchronous pipeline clocked by said common global clock; and said synchronous pipeline including a plurality of pipeline stages, data propagating through said synchronous pipeline entering a first stage and passing through multiple downstream stages, each of said plurality of stages being individually stalled by a downstream stall signal, upstream stages remaining unstalled when a stall is propagating through downstream stages until the stall propagates upstream to each remaining stage individually as a respective stall signal provided to stall the respective upstream unstalled stage, each of said plurality of pipeline stages comprising: a register stage selectively clocked by a local clock, a stall bit latch latching a stall signal responsive to said common global clock and providing said stall bit as said stall signal to an upstream register stage, and a local clock generator receiving said common global clock and clocking said register stage responsive to said stall signal.
 11. A synchronous integrated circuit as in claim 10, wherein alternate said pipeline stages latch at alternate global clock phases, said stall bit latch at each said register stage latching coincident with an adjacent said register stage.
 12. A synchronous integrated circuit as in claim 10, wherein each said register stage is a master/slave stage.
 13. A synchronous integrated circuit as in claim 10, wherein each said register stage is a pulsed latch stage.
 14. A synchronous integrated circuit as in claim 10, wherein said plurality of register stages are stages of a first-in, first-out register.
 15. A synchronous integrated circuit as in claim 10, wherein at least one register stage of said plurality of pipeline stages receives a stall indication from logic driven by said at least one register stage, said stall indication being said stall signal to said at least one register stage.
 16. A synchronous integrated circuit as in claim 15, wherein at least one second pipeline stage provides a stall indication to logic driving said at least one second pipeline stage.
 17. A microprocessor comprising a synchronous integrated circuit as in claim
 16. 18. A synchronous integrated circuit comprising: a global clock; an interlocked synchronous pipeline including a plurality of register stages clocked by said global clock, data propagating through said interlocked synchronous pipeline entering a first register stage and passing through multiple downstream register stages, upstream stages remaining unstalled when a stall is propagating through downstream stages until the stall propagates upstream individually, stage by stage, to each remaining stage; and each register stage of said interlocked synchronous pipeline only passing valid data, said each register stage passing valid data being selectively stalled individually responsive to a downstream stall signal and generating a latched stall signal as said stall signal to an upstream pipeline stage.
 19. A synchronous integrated circuit as in claim 18, wherein said each register stage is selectively clocked by a local clock and further comprises: a data valid latch latching a data valid input signal responsive to said global clock and providing a data valid output, said data valid output propagating to a next stage; a stall bit latch latching said stall signal responsive to said global clock and providing said latched stall signal; and a local clock generator receiving said global clock and clocking said register stage responsive to said data valid signal and stalling said register stage responsive to said data valid input signal and said stall signal.
 20. A synchronous integrated circuit as in claim 19, wherein alternate pipeline stages latch at alternate clock phases.
 21. A synchronous integrated circuit as in claim 19, wherein said each stage is a master/slave stage.
 22. A synchronous integrated circuit as in claim 19, wherein said each stage is a pulsed latch stage.
 23. A synchronous integrated circuit as in claim 19, wherein said interlocked synchronous pipeline comprises a first-in, first-out register.
 24. A synchronous integrated circuit as in claim 18, wherein at least one pipeline stage receives a stall indication from logic driven by said at least one pipeline stage, said stall indication being said stall signal to said at least one pipeline stage.
 25. A synchronous integrated circuit as in claim 24, wherein at least one second pipeline stage provides a stall indication to logic driving said at least one second pipeline stage.
 26. A microprocessor comprising a synchronous integrated circuit as in claim
 25. 27. A synchronous integrated circuit comprising: a global clock; and an interlocked synchronous pipeline including a plurality of stages, data propagating through said interlocked synchronous pipeline entering a first stage and passing through multiple downstream stages, each of said plurality of stages being individually stalled by a downstream stall signal, upstream stages remaining unstalled when a stall is propagating through downstream stages until the stall propagates upstream individually, stage by stage, to each remaining stage, each stage of said interlocked synchronous pipeline comprising: a register stage selectively clocked by a local clock, a data valid latch latching a data valid input signal responsive to said global clock and providing a data valid output, said data valid output propagating to an upstream stage, a stall latch latching said stall signal responsive to said data valid input and said global clock, said stall latch providing said latched stall signal as said stall signal to said upstream stage, and a local clock generator receiving said global clock and clocking said register stage responsive to said data valid signal and stalling said register stage responsive to said data valid signal and said stall signal.
 28. A synchronous integrated circuit as in claim 27, wherein alternate pipeline stages latch at alternate clock phases.
 29. A synchronous integrated circuit as in claim 28, wherein said each stage is a master/slave stage.
 30. A synchronous integrated circuit as in claim 28, wherein said each stage is a pulsed latch stage.
 31. A synchronous integrated circuit as in claim 28, wherein said interlocked synchronous pipeline comprises a first-in, first-out register.
 32. A synchronous integrated circuit as in claim 28, wherein at least one pipeline stage receives a stall indication from logic driven by said at least one pipeline stage, said stall indication being said stall signal to said at least one pipeline stage.
 33. A synchronous integrated circuit as in claim 32, wherein at least one second pipeline stage provides a stall indication to logic driving said at least one second pipeline stage.
 34. A microprocessor comprising a synchronous integrated circuit as in claim
 33. 